Introduction

The notes you are reading now, are the notes I took during the course I took on Datacamp on text mining.

Link on Datacamp

The course is quite nice and well structured, and has a lot of practical examples. But it goes quite fast, so I tend to forget the right syntax, or the right function name that has been used in the examples. So I tought to put them together in a series of examples that one can read, try and use as a reference.

As a dataset we will use tweets. Since I have not found the original csv files from the course, I decided I could download some real tweets by myself. This could prove an interesting project and could give some interesting insights if we download the right tweets. To do this I followed the instructions on this websites

Link on twitter download 1 Link on twitter download 2

Let’s get the tweets

Let’s load the necessary libraries

library("devtools")
library("twitteR")
library("ROAuth")

Now we need to save our keys

secrets <- read.csv("/Users/umberto/Documents/Passwords and Secrets/twitter-keys.csv", stringsAsFactors = FALSE, header = TRUE, sep =",")

api_key <- secrets$api_key
api_secret <- secrets$api_secret
access_token <- secrets$access_token
access_token_secret <- secrets$access_token_secret
 

setup_twitter_oauth(api_key,api_secret)
## [1] "Using browser based authentication"

Coffee Tweets

search.string <- "#coffee"
no.of.tweets <- 1000

c_tweets <- searchTwitter(search.string, n=no.of.tweets, lang="en")

Now we need to access the text of the tweets. So we do it in this way (we also need to clean up the tweets from special characters that for now we don’t need, like emoticons with teh sapply function.)

coffee_tweets = sapply(c_tweets, function(t) t$getText())

coffee_tweets <- sapply(coffee_tweets,function(row) iconv(row, "latin1", "ASCII", sub=""))

head(coffee_tweets)
##                                                  #Authors\n#Read \n#Write\nDrink #coffee :)\nCreate\nInspire\nDream Big\nNever give up &amp; just do it! https://t.co/8iRHl9Czxl 
##                                                "#Authors\n#Read \n#Write\nDrink #coffee :)\nCreate\nInspire\nDream Big\nNever give up &amp; just do it! https://t.co/8iRHl9Czxl" 
##                                                                                    My #coffee and the #rhythm. \xed\xa0\xbc\xed\xbe\xa7\u2615 #lovemyjob #lovemylife https://t.co/cOMZCPYaOM 
##                                                                                                    "My #coffee and the #rhythm.  #lovemyjob #lovemylife https://t.co/cOMZCPYaOM" 
##                                                                                            Tuesday Tunes: Dan Fogleberg https://t.co/h8rG74RKVj #coffee #TuesdayTunes #Fogleberg 
##                                                                                          "Tuesday Tunes: Dan Fogleberg https://t.co/h8rG74RKVj #coffee #TuesdayTunes #Fogleberg" 
## Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed ✍\u2615\xed\xa0\xbd\xed\xb8\x87\xed\xa0\xbe\xed\xb4\x98 #phoenixvibe… https://t.co/hFHN8p8tbG 
##                                         "Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed  #phoenixvibe https://t.co/hFHN8p8tbG" 
## Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed ✍\u2615\xed\xa0\xbd\xed\xb8\x87\xed\xa0\xbe\xed\xb4\x98 #phoenixvibe… https://t.co/mFPmq3sTgN 
##                                         "Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed  #phoenixvibe https://t.co/mFPmq3sTgN" 
## Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed ✍\u2615\xed\xa0\xbd\xed\xb8\x87\xed\xa0\xbe\xed\xb4\x98 #phoenixvibe… https://t.co/LcAmh8zBvH 
##                                         "Phoenix #coffee shop #blog @LivingPretty1 Curry Rivel by @Jennykaneauthor who was very impressed  #phoenixvibe https://t.co/LcAmh8zBvH"

It is interested to see how many parameters we get from the search

str(c_tweets[[1]])
## Reference class 'status' [package "twitteR"] with 17 fields
##  $ text         : chr "#Authors\n#Read \n#Write\nDrink #coffee :)\nCreate\nInspire\nDream Big\nNever give up &amp; just do it! https:/"| __truncated__
##  $ favorited    : logi FALSE
##  $ favoriteCount: num 0
##  $ replyToSN    : chr(0) 
##  $ created      : POSIXct[1:1], format: "2017-05-23 20:15:17"
##  $ truncated    : logi FALSE
##  $ replyToSID   : chr(0) 
##  $ id           : chr "867111712440238080"
##  $ replyToUID   : chr(0) 
##  $ statusSource : chr "<a href=\"http://www.hootsuite.com\" rel=\"nofollow\">Hootsuite</a>"
##  $ screenName   : chr "AlohaIsleCoffee"
##  $ retweetCount : num 0
##  $ isRetweet    : logi FALSE
##  $ retweeted    : logi FALSE
##  $ longitude    : chr(0) 
##  $ latitude     : chr(0) 
##  $ urls         :'data.frame':   0 obs. of  4 variables:
##   ..$ url         : chr(0) 
##   ..$ expanded_url: chr(0) 
##   ..$ dispaly_url : chr(0) 
##   ..$ indices     : num(0) 
##  and 53 methods, of which 39 are  possibly relevant:
##    getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet,
##    getLatitude, getLongitude, getReplyToSID, getReplyToSN, getReplyToUID,
##    getRetweetCount, getRetweeted, getRetweeters, getRetweets,
##    getScreenName, getStatusSource, getText, getTruncated, getUrls,
##    initialize, setCreated, setFavoriteCount, setFavorited, setId,
##    setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN,
##    setReplyToUID, setRetweetCount, setRetweeted, setScreenName,
##    setStatusSource, setText, setTruncated, setUrls, toDataFrame,
##    toDataFrame#twitterObj

So there is quite some possibilities here. But we are not actually interested in twitters now, but just in the text tweetsText. (check for example as reference this stackoverflow post).

Tea tweets

Since we are going to compare corpora of text, we need a second set of tweets, and, following the example of the course, I decided to download the first 1000 tweets on Tea

Tea Tweets

search.string <- "#tea"
no.of.tweets <- 1000

t_tweets <- searchTwitter(search.string, n=no.of.tweets, lang="en")

Now we need to access the text of the tweets. So we do it in this way (we also need to clean up the tweets from special characters that for now we don’t need, like emoticons with teh sapply function.)

tea_tweets = sapply(t_tweets, function(t) t$getText())

tea_tweets <- sapply(tea_tweets,function(row) iconv(row, "latin1", "ASCII", sub=""))

head(tea_tweets)
##                                             #assam #tea and @mcvities richtea - what a perfect partnership!!! #successhour https://t.co/dmDF2GcSc1 
##                                           "#assam #tea and @mcvities richtea - what a perfect partnership!!! #successhour https://t.co/dmDF2GcSc1" 
##                                                               RT @JessicaLSimps: #Tea time!! Love my tea \xed\xa0\xbd\xed\xb2\x9a https://t.co/tB1qzW5yee 
##                                                                              "RT @JessicaLSimps: #Tea time!! Love my tea  https://t.co/tB1qzW5yee" 
##      RT @AsilaAR: Drinking a Full Bottle of Lipton White Raspberry Tea! https://t.co/ySWP7KzhJx via @YouTube\n@Lipton #lipton #tea #health #busin… 
##     "RT @AsilaAR: Drinking a Full Bottle of Lipton White Raspberry Tea! https://t.co/ySWP7KzhJx via @YouTube\n@Lipton #lipton #tea #health #busin" 
##   #Win #international #WithLoveforBooks #Kindle Fire, Amazon #giftcard, #tea, #chocolate, owl mugs &amp; sweater #giveaway https://t.co/ZoznJm1RdA 
## "#Win #international #WithLoveforBooks #Kindle Fire, Amazon #giftcard, #tea, #chocolate, owl mugs &amp; sweater #giveaway https://t.co/ZoznJm1RdA" 
##       RT @RushAntiques: Beautiful Mother of Pearl decorated Tea Caddy, remnants of original tin lining! #teacaddy #teatime #tea https://t.co/xWyR… 
##      "RT @RushAntiques: Beautiful Mother of Pearl decorated Tea Caddy, remnants of original tin lining! #teacaddy #teatime #tea https://t.co/xWyR" 
##                                                            What #beverage get's you through the work day? \n\n#coffee #tea #water #juice #kombucha 
##                                                          "What #beverage get's you through the work day? \n\n#coffee #tea #water #juice #kombucha"

Let’s start with text mining

To do text mining one of the most used library (and the one I will use here) is tm.

library("tm")

First we need to create a vector of texts

coffee_source <- VectorSource(coffee_tweets)
tea_source <- VectorSource(tea_tweets)

Then we need to make a VCorpus of the list of tweets

coffee_corpus <- VCorpus(coffee_source)
tea_corpus <- VCorpus(tea_source)
coffee_corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1000

So if we want to see the text of a tweet in the corpus we can use

coffee_corpus[[15]][1]
## $content
## [1] "What #beverage get's you through the work day? \n\n#coffee #tea #water #juice #kombucha"
tea_corpus[[15]][1]
## $content
## [1] "SUMMER IS HERE!\n\nhttps://t.co/JFsHIYWgUm\n\n#VanyaCrafts #etsy #summervibes #sunshine #funinthesun #summer #tea https://t.co/7kbMrLwx00"

Cleaning text

Now that I how to make a corpus, I can focus on cleaning, or preprocessing, the text. In bag of words text mining, cleaning helps aggregate terms. For example, it may make sense that the words “miner”, “mining” and “mine” should be considered one term. Specific preprocessing steps will vary based on the project. For example, the words used in tweets are vastly different than those used in legal documents, so the cleaning process can also be quite different. (Text Source: Datacamp)

From Data Source

Common preprocessing functions include:

  • tolower(): Make all characters lowercase
  • removePunctuation(): Remove all punctuation marks
  • removeNumbers(): Remove numbers
  • stripWhitespace(): Remove excess whitespace

Note that tolower() is part of base R, while the other three functions come from the tm package. Going forward, we’ll load the tm and qdap for you when they are needed. Every time we introduce a new package, we’ll have you load it the first time.

The qdap package offers other text cleaning functions. Each is useful in its own way and is particularly powerful when combined with the others.

  • bracketX(): Remove all text within brackets (e.g. “It’s (so) cool” becomes “It’s cool”)
  • replace_number(): Replace numbers with their word equivalents (e.g. “2” becomes “two”)
  • replace_abbreviation(): Replace abbreviations with their full text equivalents (e.g. “Sr” becomes “Senior”)
  • replace_contraction(): Convert contractions back to their base words (e.g. “shouldn’t” becomes “should not”)
  • replace_symbol() Replace common symbols with their word equivalents (e.g. “$” becomes “dollar”)

Stopwords

Using the c() function allows you to add new words (separated by commas) to the stop words list. For example, the following would add “word1” and “word2” to the default list of English stop words:

all_stops <- c("word1", "word2", stopwords("en"))

You can use the following command to remove stopwords

removeWords(text, stopwords("en"))

Stemming

Here is an example of stemming

stemDocument(c("computational", "computers", "computation"))
## [1] "comput" "comput" "comput"

Here is an example of using stemming

# Create complicate
complicate <- c("complicated", "complication", "complicatedly")
# Perform word stemming: stem_doc
stem_doc <- stemDocument(complicate)
# Create the completion dictionary: comp_dict
comp_dict <- "complicate"
# Perform stem completion: complete_text 
complete_text <- stemCompletion(stem_doc, comp_dict)
# Print complete_text
complete_text
##      complic      complic      complic 
## "complicate" "complicate" "complicate"

Clean the Corpus

To clean the Corpus we can define a function that applies several functions on the corpus

clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "mug", "coffee"))
  return(corpus)
}

Then we can try to apply it on our corpus

clean_corp <- clean_corpus(coffee_corpus)

Then we can pring a cleaned-up tweet

clean_corp[[227]][1]
## $content
## [1] "stainless steel    550ml  1354 httpstcogye0unwtae via kitchenhandle httpstcotptjmwxkfd"

and the original one

coffee_corpus[[227]][1]
## $content
## [1] "Stainless Steel #Coffee Mug - 550ml, only $13.54 https://t.co/Gye0UNWTaE via @kitchenhandle https://t.co/tpTJmwXKfD"

So we have removed special characters, punctuation and so on. Not all the words make much sense really (for example twitter usernames) but it should not be a problem since we don’t expect to see them very often in our corpus.

Make a document-term matrix

We can use the following code to make a DTM. Each document is represented as a row and each word as a column.

coffee_dtm <- DocumentTermMatrix(clean_corp)

# Print out coffee_dtm data
print(coffee_dtm)
## <<DocumentTermMatrix (documents: 1000, terms: 4026)>>
## Non-/sparse entries: 9146/4016854
## Sparsity           : 100%
## Maximal term length: 63
## Weighting          : term frequency (tf)
# Convert coffee_dtm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_dtm)

# Print the dimensions of coffee_m
dim(coffee_m)
## [1] 1000 4026
# Review a portion of the matrix
coffee_m[148:150, 2587: 2590]
##      Terms
## Docs  mandobarista mandobaristadaily mango mannat
##   148            0                 0     0      0
##   149            0                 0     0      0
##   150            0                 0     0      0

Make a document-term matrix (DTM)

You can also transpose a TDM, to have each word as a row and each column as a document.

# Create a TDM from clean_corp: coffee_tdm
coffee_tdm <- TermDocumentMatrix(clean_corp)

# Print coffee_tdm data
print(coffee_tdm)
## <<TermDocumentMatrix (terms: 4026, documents: 1000)>>
## Non-/sparse entries: 9146/4016854
## Sparsity           : 100%
## Maximal term length: 63
## Weighting          : term frequency (tf)
# Convert coffee_tdm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_tdm)

# Print the dimensions of the matrix
dim(coffee_m)
## [1] 4026 1000
# Review a portion of the matrix
coffee_m[2587:2590, 148:150]
##                    Docs
## Terms               148 149 150
##   mandobarista        0   0   0
##   mandobaristadaily   0   0   0
##   mango               0   0   0
##   mannat              0   0   0

Frequent terms with tm

(source Datacamp) Now that you know how to make a term-document matrix, as well as its transpose, the document-term matrix, we will use it as the basis for some analysis. In order to analyze it we need to change it to a simple matrix like we did in chapter 1 using as.matrix.

Calling rowSums() on your newly made matrix aggregates all the terms used in a passage. Once you have the rowSums(), you can sort() them with decreasing = TRUE, so you can focus on the most common terms.

Lastly, you can make a barplot() of the top 5 terms of term_frequency with the following code.

barplot(term_frequency[1:5], col = "#C0DE25")

Of course, you could take our ggplot2 course to learn how to customize the plot even more… :)

So let’s try with out coffee tweets

## coffee_tdm is still loaded in your workspace

# Create a matrix: coffee_m
coffee_m <- as.matrix(coffee_tdm)

# Calculate the rowSums: term_frequency
term_frequency <- rowSums(coffee_m)

# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing = TRUE)

# View the top 10 most common words
term_frequency[1:10]
##     cup    love     amp     day morning     can     via     one   drink 
##      86      62      59      55      48      47      46      44      41 
##   great 
##      40
# Plot a barchart of the 10 most common words
barplot(term_frequency[1:10], col = "tan", las = 2)

Now let’s make it a bit more pretty with ggplot2

library(ggplot2)
library(dplyr)

tf <- as.data.frame(term_frequency)
tf$words <- row.names(tf)
tf10 <- as.data.frame(tf[1:10,])

# We need to make the words factors (ordered) otherwise ggplot2 will order the 
# x axis alphabetically
tf10 <- mutate(tf10, words = factor(words, words))

ggplot(tf10, aes(x = tf10$words , y = tf10$term_frequency   )) + geom_bar(stat = "identity", fill = "tan", col = "black")+ theme_grey()+theme(text = element_text(size=16),  axis.title.x=element_blank(),axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ylab("Words Frequency") 

Note that the following command don’t work from Rstudio if you want to use knitr. So the solution is to do it from the console with the following commands

The command will render an html file in the directory where the Rmd file is.

library(rJava)
library(qdap)

Let’s build a word frequency plot with qdap library

frequency <- freq_terms(coffee_tweets, top = 10, at.least = 3, stopwords = "Top200Words")

frequency <- mutate(frequency, WORD = factor(WORD, WORD))

ggplot(frequency, aes(x = frequency$WORD , y = frequency$FREQ   )) + geom_bar(stat = "identity", fill = "tan", col = "black")+ theme_grey()+theme(text = element_text(size=16),  axis.title.x=element_blank(),axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ylab("Words Frequency") 

Now let’s remove more stopwords

frequency2 <- freq_terms(coffee_tweets, top = 10, at.least = 3, stopwords = tm::stopwords("english"))

frequency2 <- mutate(frequency2, WORD = factor(WORD, WORD))

ggplot(frequency2, aes(x = frequency2$WORD , y = frequency2$FREQ   )) + geom_bar(stat = "identity", fill = "tan", col = "black")+ theme_grey()+theme(text = element_text(size=16),  axis.title.x=element_blank(),axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ylab("Words Frequency") 

Wordclouds

library(wordcloud)
term_frequency[1:10]
##     cup    love     amp     day morning     can     via     one   drink 
##      86      62      59      55      48      47      46      44      41 
##   great 
##      40
word_freqs <- data.frame(term = names(term_frequency), num = term_frequency)
wordcloud(word_freqs$term, word_freqs$num, max.words = 100, colors = "red")
## Warning in wordcloud(word_freqs$term, word_freqs$num, max.words = 100,
## colors = "red"): tuesdaythoughts could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word_freqs$term, word_freqs$num, max.words = 100,
## colors = "red"): coffeelover could not be fit on page. It will not be
## plotted.

Now we need to remove some words that are clear are appearing while talking about coffee

# Add new stop words to clean_corpus()
clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, 
                   c(stopwords("en"), "brew", "cafe", "coffeetime", "cup", "coffee"))
  return(corpus)
}

clean_coffee <- clean_corpus(coffee_corpus)
coffee_tdm <- TermDocumentMatrix(clean_coffee)
coffee_m <- as.matrix(coffee_tdm)
coffee_words <- rowSums(coffee_m)

Now we prepare the right order of words for the wordcloud

coffee_words <- sort(coffee_words, decreasing = TRUE)
coffee_words[1:6]
##    love     amp     day morning     can     via 
##      62      59      55      48      47      46
coffee_freqs <- data.frame (term = names(coffee_words), num = coffee_words)

wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 50, colors = "red")
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 50, :
## amp could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 50, :
## cbpcindy could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 50, :
## problem could not be fit on page. It will not be plotted.

Improve word colours

wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 100, colors = c("grey80", "darkgoldenrod1", "tomato"))
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : coffeeaddict could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : httpstcopqxaevqg could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : love could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : caramiasg could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : great could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : coffeelovers could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : suziday could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : time could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : day could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : queenbeancoffee could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : elfortney could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : good could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : problem could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : amp could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : selection could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : winewankers could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : like could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : winelover could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : freshroasters could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : shop could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : coffeelover could not be fit on page. It will not be plotted.

RColorBrewer color schemes are organized into three categories:

  • Sequential: Colors ascend from light to dark in sequence
  • Qualitative: Colors are chosen for their pleasing qualities together
  • Diverging: Colors have two distinct color spectra with lighter colors in between

To change the colors parameter of the wordcloud() function you can use a select a palette from RColorBrewer such as “Greens”. The function display.brewer.all() will list all predefined color palettes. More information on ColorBrewer (the framework behind RColorBrewer) is available on its website.

(Source: datacamp)

The function brewer.pal() allows you to select colors from a palette. Specify the number of distinct colors needed (e.g. 8) and the predefined palette to select from (e.g. “Greens”). Often in word clouds, very faint colors are washed out so it may make sense to remove the first couple from a brewer.pal() selection, leaving only the darkest.

Here’s an example:

green_pal <- brewer.pal(8, "Greens")
green_pal <- green_pal[-(1:2)]

Then just add that object to the wordcloud() function.

wordcloud(chardonnay_freqs$term, chardonnay_freqs$num, max.words = 100, colors = green_pal)

(Source: datacamp)

The command display.brewer.all() will display all palettes. Is a very cool command

display.brewer.all()

Let’s try to use the PuOr palette

# Create purple_orange
PuOr <- brewer.pal(10, "PuOr")
purple_orange <- PuOr[-(1:2)]

And now we can create the wordcloud woith this palette

wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 100, colors = purple_orange)
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : coffeelover could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : coffeelovers could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : love could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : need could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : wine could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : goodmorning could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : shop could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : espresso could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : mug could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : https could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : freshroasters could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : amp could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : httpstcopqxaevqg could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : tuesdaythoughts could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : httpstcomzmgfib could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : caffeine could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : make could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : choose could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : morning could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : delicious could not be fit on page. It will not be plotted.
## Warning in wordcloud(coffee_freqs$term, coffee_freqs$num, max.words =
## 100, : elfortney could not be fit on page. It will not be plotted.

Sometimes not all the words can be plotted. In this case the only solutions are to reduce the number of words or to reduce the scale of the words themselves. For example

wordcloud(coffee_freqs$term, coffee_freqs$num, max.words = 100, colors = purple_orange, scale = c(2,0.3))

Now all the words are in the plots.

Wordclouds with bigrams

Now sometimes single words don’t tell the entire story and is interesting to do the same plot with bigrams (words that appear together in the corpus). The tokenizer from RWeka is very useful.

library(RWeka)

Then we need to get the couples of words (note that the definition give below will give you only bigrams, and not single words anymore).

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm.bigram <- TermDocumentMatrix(coffee_corpus, control = list(tokenize = BigramTokenizer))

Then we can get the frequencies of the bigrams

freq <- sort(rowSums(as.matrix(tdm.bigram)), decreasing = TRUE)
freq.df <- data.frame(word = names(freq), freq= freq)
head(freq.df)
##                        word freq
## https //t         https //t 1011
## #coffee https #coffee https  139
## the solution   the solution   38
## cup of               cup of   37
## of #coffee       of #coffee   36
## a #coffee         a #coffee   34

Now we can plot the wordcloud

wordcloud(freq.df$word, freq.df$freq, max.words = 50, random.order = F, colors = purple_orange, scale = c(4,0.7))

We need of course first to do a cleanup of the bigrams list. But that is something that goes beyond the notes I am writing. An important point is that if you remove all stop words like “not” you may loose important informations for bigrams (like negations).

Trigrams

Just as a reference here is the code to do wordclouds with trigrams

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm.trigram <- TermDocumentMatrix(coffee_corpus, control = list(tokenize= TrigramTokenizer))

freq <- sort(rowSums(as.matrix(tdm.trigram)), decreasing = TRUE)
freq.df <- data.frame(word = names(freq), freq= freq)
head(freq.df)
##                                            word freq
## #coffee https //t             #coffee https //t  134
## cup of #coffee                   cup of #coffee   24
## coffee #coffee https       coffee #coffee https   21
## drink more coffee             drink more coffee   19
## https //t co/pqxa5ev8qg https //t co/pqxa5ev8qg   19
## is quite simple                 is quite simple   19

Common Words between Corpora

To find common words we need to create two “big” documents of tweets. We need to collapse all tweets together separated by a space

all_coffee <- paste (coffee_tweets, collapse = " ")
all_tea <- paste (tea_tweets,collapse = " ")
all_tweets <- c(all_coffee, all_tea)

Now we convert to a Corpus

# Convert to a vector source
all_tweets <- VectorSource(all_tweets)

# Create all_corpus
all_corpus <- VCorpus(all_tweets)

Now that we have a corpus filled with words used in both the tea and coffee tweets files, we can clean the corpus, convert it into a TermDocumentMatrix, and then a matrix to prepare it for a commonality.cloud(). First we need to define a proper cleaning function that contains words like coffee and tea

clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "mug", "coffee", "tea"))
  return(corpus)
}

Let’s clean the corpus

# Clean the corpus
all_clean <- clean_corpus (all_corpus)

# Create all_tdm
all_tdm <- TermDocumentMatrix(all_clean) 

# Create all_m
all_m <- as.matrix(all_tdm)

Now the communality cloud

commonality.cloud(all_m, max.words = 100, colors = "steelblue1")

Comparison Cloud

You can plot a comparison cloud in this way

comparison.cloud(all_m, max.words = 50, colors = c("orange", "blue"), scale = c(3,0.5))

(Source Datacamp) A commonality.cloud() may be misleading since words could be represented disproportionately in one corpus or the other, even if they are shared. In the commonality cloud, they would show up without telling you which one of the corpora has more term occurrences.

To solve this problem, we can create a pyramid.plot() from the plotrix package.

library(plotrix)
all_tdm_m <- all_m
# Create common_words
common_words <- subset(all_tdm_m, all_tdm_m[, 1] > 0 & all_tdm_m[, 2] > 0)

# Create difference
difference <- abs(common_words[, 1] - common_words[, 2])

# Combine common_words and difference
common_words <- cbind(common_words, difference)

# Order the data frame from most differences to least
common_words <- common_words[order(common_words[, 3], decreasing = TRUE), ]

# Create top25_df
top25_df <- data.frame(x = common_words[1:25, 1], 
                       y = common_words[1:25, 2], 
                       labels = rownames(common_words[1:25, ]))

# Create the pyramid plot
pyramid.plot(top25_df$x, top25_df$y,
             labels = top25_df$labels, gap = 60,
             top.labels = c("Coffee", "Words", "Tea"),
             main = "Words in Common", laxlab = NULL, 
             raxlab = NULL, unit = NULL)

## [1] 5.1 4.1 4.1 2.1

Word Networks

In a network graph, the circles are called nodes and represent individual terms, while the lines connecting the circles are called edges and represent the connections between the terms.

For the over-caffeinated text miner, qdap provides a shorcut for making word networks. The word_network_plot() and word_associate() functions both make word networks easy!

word_associate(coffee_tweets, match.string = c("monday"), 
               stopwords = c(Top200Words, "coffee", "mug"), 
               network.plot = TRUE)
## Warning in text2color(words = V(g)$label, recode.words = target.words,
## colors = label.colors): length of colors should be 1 more than length of
## recode.words

##   row group unit text                                                                                                                                       
## 1  51   all   51 RT @suziday123: @gigirules7 Monday #coffee Morning #letsdothis #MondayMotivaton @CaraMiaSG @MagnumExotics @Coffee_and_Bean https://t       
## 2  62   all   62 Alto Astral #bemorehuman #monday #saturday #goodmorning #overheadsquat #snatch #coffee https://t.co/gUp677aGDp                             
## 3 477   all  477 RT @badolinaLDN: Monday morning... Our #coffee is so strong it wakes up the neighbours! (sorry neighbours....) https://t.co/GjgqbKMqmD     
## 4 748   all  748 When Tuesday after a long weekend still feels like Monday #parenting #coffee #CoffeeAddict #preschooler #coffeetime https://t.co/od8zLwjjdb
## 
## Match Terms
## ===========
## 
## List 1:
## monday, mondaymotivaton
## 

Distance Matrix and Dendograms

Now that you understand the steps in making a dendrogram, you can apply them to text. But first, you have to limit the number of words in your TDM using removeSparseTerms() from tm. Why would you want to adjust the sparsity of the TDM/DTM?

TDMs and DTMs are sparse, meaning they contain mostly zeros. Remember that 1000 tweets can become a TDM with over 3000 terms! You won’t be able to easily interpret a dendrogram that is so cluttered, especially if you are working on more text.

A good TDM has between 25 and 70 terms. The lower the sparse value, the more terms are kept. The closer it is to 1, the fewer are kept. This value is a percentage cutoff of zeros for each term in the TDM.

Let’s see the dimensions of your coffee tdm

dim(coffee_tdm)
## [1] 3950 1000

Let’s remove some terms

coffee_tdm1 <- removeSparseTerms(coffee_tdm, sparse = 0.95)
dim(coffee_tdm1)
## [1]    3 1000

Let’s see a dendogram now

coffee_tdm1_m <- as.matrix(coffee_tdm1)
coffee_tdm1_df <- as.data.frame(coffee_tdm1_m)
coffee_dist <- dist(coffee_tdm1_df)

coffee_hc <- hclust(coffee_dist)
plot(coffee_hc)

Now let’s make the dendrogram more appealing

library(dendextend)

Now

hcd <- as.dendrogram(coffee_hc)
labels(hcd)
## [1] "love" "amp"  "day"

Now let’s work on the appearance

hcd <- branches_attr_by_labels(hcd, c("mondaymorning", "work"), "red")
## Warning in branches_attr_by_labels(hcd, c("mondaymorning", "work"), "red"): Not all of the labels you provided are included in the dendrogram.
## The following labels were omitted:mondaymorningwork
plot(hcd, main = "Better Dendrogram")

Now let’s add rectangular shapes around the clusters

# Add cluster rectangles 
plot(hcd, main = "Better Dendrogram")
rect.dendrogram(hcd, k = 2, border = "grey50")

Word Associations

Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM or DTM. Scores range from 0 to 1. A score of 1 means that two words always appear together, while a score of 0 means that they never appear together.

To use findAssocs() pass in a TDM or DTM, the search term, and a minimum correlation. The function will return a list of all other terms that meet or exceed the minimum threshold.

findAssocs(tdm, "word", 0.25)
# Create associations
associations <- findAssocs(coffee_tdm, "mug", 0.2)

# View the venti associations
print(associations)
## $mug
##            ceramic            fathers           deserves 
##               0.42               0.37               0.34 
##         epiconetsy       etsychaching httpstcoiwovyllqjz 
##               0.34               0.34               0.34 
##           birthday        daydrinking       disappearing 
##               0.32               0.32               0.32 
##             hockey  httpstcocgzitfizv   httpstcohcqgsyyv 
##               0.32               0.32               0.32 
## httpstcoqzukkwrhnp       morningkraze           morphing 
##               0.32               0.32               0.32 
##           silicone  whiskeyandwhineco             batman 
##               0.32               0.32               0.28 
##               june               away            battery 
##               0.27               0.22               0.22 
##       coffeelovers              color            designs 
##               0.22               0.22               0.22 
##               etsy              funny              retro 
##               0.22               0.22               0.22 
##          sensitive             travel               blue 
##               0.22               0.22               0.21 
##           creative      kitchenhandle 
##               0.21               0.21
library(ggthemes)

# Create associations_df
associations_df <- list_vect2df(associations)[,2:3]

# Plot the associations_df values (don't change this)
ggplot(associations_df, aes(y = associations_df[, 1])) + 
  geom_point(aes(x = associations_df[, 2]), 
             data = associations_df, size = 3) + 
  theme_gdocs()

Similarity matrix

require(proxy)

coffee_tdm_m <- as.matrix(coffee_tdm)

coffee_cosine_dist_mat <- as.matrix(dist(coffee_tdm_m, method = "cosine"))

what dimensions we have in this matrix?

dim(coffee_cosine_dist_mat)
## [1] 3950 3950

as expected. Let’s check some rows

coffee_cosine_dist_mat[1:5,1:5]
##              abroad abwbhlucas accessory account acneskinsite
## abroad            0          1         1       1            1
## abwbhlucas        1          0         1       1            1
## accessory         1          1         0       1            1
## account           1          1         1       0            1
## acneskinsite      1          1         1       1            0

We can do the same calculations using the fact we have sparse matrices

library(slam)
cosine_dist_mat <- crossprod_simple_triplet_matrix(coffee_tdm)/(sqrt(col_sums(coffee_tdm^2) %*% t(col_sums(coffee_tdm^2))))
cosine_dist_mat[1:5,1:5]
##     Docs
## Docs 1 2 3   4   5
##    1 1 0 0 0.0 0.0
##    2 0 1 0 0.0 0.0
##    3 0 0 1 0.0 0.0
##    4 0 0 0 1.0 0.9
##    5 0 0 0 0.9 1.0

Tweets 2 and 3 have a similarity score of 0.92, so a very high one. Let’s check them

print(coffee_tweets[[2]])
## [1] "My #coffee and the #rhythm.  #lovemyjob #lovemylife https://t.co/cOMZCPYaOM"
print(coffee_tweets[[3]])
## [1] "Tuesday Tunes: Dan Fogleberg https://t.co/h8rG74RKVj #coffee #TuesdayTunes #Fogleberg"

They are indeed very similar being one a retweet of the other.

Bag of words

my.tdm <- TermDocumentMatrix(coffee_corpus, control = list(weighting = weightTfIdf))
my.dtm <- DocumentTermMatrix(coffee_corpus, control = list(weighting = weightTfIdf, stopwords = TRUE))
inspect(my.dtm)
## <<DocumentTermMatrix (documents: 1000, terms: 4678)>>
## Non-/sparse entries: 10650/4667350
## Sparsity           : 100%
## Maximal term length: 73
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##      Terms
## Docs  &amp;    #coffee #coffeelover #wine can coffee cup drink need via
##   10      0 0.00000000            0     0   0      0   0     0    0   0
##   315     0 0.00000000            0     0   0      0   0     0    0   0
##   509     0 0.00000000            0     0   0      0   0     0    0   0
##   560     0 0.00000000            0     0   0      0   0     0    0   0
##   721     0 0.00000000            0     0   0      0   0     0    0   0
##   728     0 0.00000000            0     0   0      0   0     0    0   0
##   78      0 0.00000000            0     0   0      0   0     0    0   0
##   800     0 0.01293925            0     0   0      0   0     0    0   0
##   807     0 0.00000000            0     0   0      0   0     0    0   0
##   82      0 0.00000000            0     0   0      0   0     0    0   0

Let’s find (for example) all words that appear twice in any document

findFreqTerms(my.tdm, 200)
## character(0)
cosine_dist_mat <- crossprod_simple_triplet_matrix(my.tdm)/(sqrt(col_sums(my.tdm^2) %*% t(col_sums(my.tdm^2))))
cosine_dist_mat[1:5,1:5]
##     Docs
## Docs            1            2            3            4            5
##    1 1.000000e+00 5.174231e-05 4.111195e-05 3.638475e-05 3.638475e-05
##    2 5.174231e-05 1.000000e+00 6.381586e-05 5.647809e-05 5.647809e-05
##    3 4.111195e-05 6.381586e-05 1.000000e+00 4.487477e-05 4.487477e-05
##    4 3.638475e-05 5.647809e-05 4.487477e-05 1.000000e+00 8.798005e-01
##    5 3.638475e-05 5.647809e-05 4.487477e-05 8.798005e-01 1.000000e+00
y <- which(cosine_dist_mat>0.5, arr.in = TRUE)
str(y)
##  int [1:2744, 1:2] 1 2 3 4 5 6 4 5 6 4 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2744] "1" "2" "3" "4" ...
##   ..$ : chr [1:2] "Docs" "Docs"
y
##      Docs Docs
## 1       1    1
## 2       2    2
## 3       3    3
## 4       4    4
## 5       5    4
## 6       6    4
## 4       4    5
## 5       5    5
## 6       6    5
## 4       4    6
## 5       5    6
## 6       6    6
## 7       7    7
## 81     81    7
## 152   152    7
## 8       8    8
## 9       9    9
## 10     10   10
## 11     11   11
## 12     12   12
## 27     27   12
## 50     50   12
## 97     97   12
## 102   102   12
## 113   113   12
## 115   115   12
## 132   132   12
## 217   217   12
## 335   335   12
## 644   644   12
## 716   716   12
## 815   815   12
## 843   843   12
## 942   942   12
## 996   996   12
## 13     13   13
## 14     14   14
## 15     15   15
## 16     16   16
## 37     37   16
## 368   368   16
## 689   689   16
## 17     17   17
## 220   220   17
## 18     18   18
## 19     19   19
## 144   144   19
## 145   145   19
## 20     20   20
## 21     21   21
## 403   403   21
## 851   851   21
## 916   916   21
## 931   931   21
## 22     22   22
## 206   206   22
## 461   461   22
## 550   550   22
## 776   776   22
## 803   803   22
## 936   936   22
## 23     23   23
## 24     24   23
## 23     23   24
## 24     24   24
## 25     25   25
## 26     26   26
## 12     12   27
## 27     27   27
## 50     50   27
## 97     97   27
## 102   102   27
## 113   113   27
## 115   115   27
## 132   132   27
## 217   217   27
## 335   335   27
## 644   644   27
## 716   716   27
## 815   815   27
## 843   843   27
## 942   942   27
## 996   996   27
## 28     28   28
## 380   380   28
## 863   863   28
## 29     29   29
## 535   535   29
## 953   953   29
## 30     30   30
## 329   329   30
## 31     31   31
## 137   137   31
## 245   245   31
## 32     32   32
## 33     33   33
## 34     34   34
## 35     35   35
## 36     36   36
## 16     16   37
## 37     37   37
## 368   368   37
## 689   689   37
## 38     38   38
## 676   676   38
## 39     39   39
## 40     40   40
## 41     41   41
## 42     42   42
## 43     43   43
## 44     44   44
## 45     45   45
## 80     80   45
## 46     46   46
## 47     47   47
## 124   124   47
## 229   229   47
## 296   296   47
## 355   355   47
## 476   476   47
## 481   481   47
## 572   572   47
## 641   641   47
## 971   971   47
## 997   997   47
## 48     48   48
## 49     49   49
## 12     12   50
## 27     27   50
## 50     50   50
## 97     97   50
## 102   102   50
## 113   113   50
## 115   115   50
## 132   132   50
## 217   217   50
## 335   335   50
## 644   644   50
## 716   716   50
## 815   815   50
## 843   843   50
## 942   942   50
## 996   996   50
## 51     51   51
## 52     52   52
## 53     53   53
## 54     54   54
## 55     55   55
## 597   597   55
## 56     56   56
## 57     57   57
## 58     58   58
## 482   482   58
## 774   774   58
## 958   958   58
## 987   987   58
## 59     59   59
## 60     60   60
## 61     61   61
## 62     62   62
## 63     63   63
## 64     64   64
## 71     71   64
## 73     73   64
## 76     76   64
## 333   333   64
## 358   358   64
## 511   511   64
## 819   819   64
## 824   824   64
## 65     65   65
## 529   529   65
## 846   846   65
## 926   926   65
## 66     66   66
## 67     67   67
## 68     68   68
## 69     69   69
## 371   371   69
## 848   848   69
## 924   924   69
## 70     70   70
## 574   574   70
## 790   790   70
## 809   809   70
## 832   832   70
## 841   841   70
## 935   935   70
## 960   960   70
## 64     64   71
## 71     71   71
## 73     73   71
## 76     76   71
## 333   333   71
## 358   358   71
## 511   511   71
## 819   819   71
## 824   824   71
## 72     72   72
## 74     74   72
## 64     64   73
## 71     71   73
## 73     73   73
## 76     76   73
## 333   333   73
## 358   358   73
## 511   511   73
## 819   819   73
## 824   824   73
## 72     72   74
## 74     74   74
## 75     75   75
## 64     64   76
## 71     71   76
## 73     73   76
## 76     76   76
## 333   333   76
## 358   358   76
## 511   511   76
## 819   819   76
## 824   824   76
## 77     77   77
## 78     78   78
## 79     79   79
## 86     86   79
## 45     45   80
## 80     80   80
## 7       7   81
## 81     81   81
## 152   152   81
## 82     82   82
## 83     83   83
## 84     84   84
## 85     85   85
## 79     79   86
## 86     86   86
## 87     87   87
## 88     88   88
## 741   741   88
## 89     89   89
## 90     90   90
## 91     91   91
## 138   138   91
## 307   307   91
## 92     92   92
## 458   458   92
## 715   715   92
## 979   979   92
## 93     93   93
## 94     94   94
## 719   719   94
## 975   975   94
## 995   995   94
## 95     95   95
## 96     96   96
## 12     12   97
## 27     27   97
## 50     50   97
## 97     97   97
## 102   102   97
## 113   113   97
## 115   115   97
## 132   132   97
## 217   217   97
## 335   335   97
## 644   644   97
## 716   716   97
## 815   815   97
## 843   843   97
## 942   942   97
## 996   996   97
## 98     98   98
## 99     99   99
## 549   549   99
## 100   100  100
## 101   101  101
## 12     12  102
## 27     27  102
## 50     50  102
## 97     97  102
## 102   102  102
## 113   113  102
## 115   115  102
## 132   132  102
## 217   217  102
## 335   335  102
## 644   644  102
## 716   716  102
## 815   815  102
## 843   843  102
## 942   942  102
## 996   996  102
## 103   103  103
## 104   104  104
## 431   431  104
## 495   495  104
## 854   854  104
## 105   105  105
## 838   838  105
## 106   106  106
## 114   114  106
## 107   107  107
## 907   907  107
## 108   108  108
## 109   109  109
## 517   517  109
## 829   829  109
## 929   929  109
## 110   110  110
## 111   111  111
## 112   112  112
## 12     12  113
## 27     27  113
## 50     50  113
## 97     97  113
## 102   102  113
## 113   113  113
## 115   115  113
## 132   132  113
## 217   217  113
## 335   335  113
## 644   644  113
## 716   716  113
## 815   815  113
## 843   843  113
## 942   942  113
## 996   996  113
## 106   106  114
## 114   114  114
## 12     12  115
## 27     27  115
## 50     50  115
## 97     97  115
## 102   102  115
## 113   113  115
## 115   115  115
## 132   132  115
## 217   217  115
## 335   335  115
## 644   644  115
## 716   716  115
## 815   815  115
## 843   843  115
## 942   942  115
## 996   996  115
## 116   116  116
## 117   117  117
## 614   614  117
## 664   664  117
## 118   118  118
## 119   119  119
## 120   120  120
## 424   424  120
## 121   121  121
## 596   596  121
## 122   122  122
## 123   123  123
## 47     47  124
## 124   124  124
## 229   229  124
## 296   296  124
## 355   355  124
## 476   476  124
## 481   481  124
## 572   572  124
## 641   641  124
## 971   971  124
## 997   997  124
## 125   125  125
## 126   126  126
## 576   576  126
## 796   796  126
## 127   127  127
## 331   331  127
## 128   128  128
## 239   239  128
## 274   274  128
## 129   129  129
## 130   130  130
## 141   141  130
## 131   131  131
## 12     12  132
## 27     27  132
## 50     50  132
## 97     97  132
## 102   102  132
## 113   113  132
## 115   115  132
## 132   132  132
## 217   217  132
## 335   335  132
## 644   644  132
## 716   716  132
## 815   815  132
## 843   843  132
## 942   942  132
## 996   996  132
## 133   133  133
## 134   134  134
## 165   165  134
## 434   434  134
## 857   857  134
## 964   964  134
## 135   135  135
## 136   136  136
## 31     31  137
## 137   137  137
## 245   245  137
## 91     91  138
## 138   138  138
## 307   307  138
## 139   139  139
## 140   140  140
## 130   130  141
## 141   141  141
## 142   142  142
## 143   143  143
## 19     19  144
## 144   144  144
## 145   145  144
## 19     19  145
## 144   144  145
## 145   145  145
## 146   146  146
## 147   147  147
## 830   830  147
## 148   148  148
## 278   278  148
## 657   657  148
## 787   787  148
## 954   954  148
## 149   149  149
## 150   150  150
## 151   151  151
## 7       7  152
## 81     81  152
## 152   152  152
## 153   153  153
## 154   154  154
## 155   155  155
## 156   156  156
## 157   157  157
## 158   158  157
## 157   157  158
## 158   158  158
## 159   159  159
## 160   160  160
## 161   161  161
## 162   162  162
## 163   163  163
## 164   164  164
## 134   134  165
## 165   165  165
## 434   434  165
## 857   857  165
## 964   964  165
## 166   166  166
## 167   167  167
## 168   168  168
## 215   215  168
## 652   652  168
## 760   760  168
## 169   169  169
## 170   170  170
## 171   171  171
## 944   944  171
## 172   172  172
## 268   268  172
## 912   912  172
## 173   173  173
## 174   174  174
## 175   175  175
## 441   441  175
## 449   449  175
## 176   176  176
## 177   177  177
## 178   178  178
## 179   179  179
## 180   180  179
## 179   179  180
## 180   180  180
## 181   181  181
## 855   855  181
## 182   182  182
## 203   203  182
## 183   183  183
## 184   184  184
## 185   185  185
## 209   209  185
## 186   186  186
## 187   187  187
## 445   445  187
## 188   188  188
## 189   189  189
## 190   190  190
## 194   194  190
## 252   252  190
## 262   262  190
## 265   265  190
## 289   289  190
##  [ reached getOption("max.print") -- omitted 2244 rows ]
print(coffee_tweets[[209]])
## [1] "RT @WiltsArtisans: #Glutenfree #Plum &amp; #cinnamon  #slice perfect with #coffee https://t.co/bFIEWFs611"
print(coffee_tweets[[202]])
## [1] "Always getting fresh roasts out to clients within 24 hours. Portland metro area gets delivery next day by me! https://t.co/wQZU2Pos0C"

and we can extract the values of the matrix with

cosine_dist_mat[y]
##    [1] 1.0000000 1.0000000 1.0000000 1.0000000 0.8798005 0.8798005
##    [7] 0.8798005 1.0000000 0.8798005 0.8798005 0.8798005 1.0000000
##   [13] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##   [19] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##   [25] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##   [31] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##   [37] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##   [43] 1.0000000 0.9160175 1.0000000 1.0000000 0.9422000 0.8357528
##   [49] 1.0000000 1.0000000 0.7129132 0.7129132 0.7025486 0.7025486
##   [55] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##   [61] 1.0000000 1.0000000 0.8807944 0.8807944 1.0000000 1.0000000
##   [67] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##   [73] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##   [79] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##   [85] 1.0000000 0.9249570 1.0000000 0.8415018 1.0000000 1.0000000
##   [91] 0.8424397 1.0000000 1.0000000 0.9380438 1.0000000 1.0000000
##   [97] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [103] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [109] 1.0000000 1.0000000 1.0000000 1.0000000 0.8257923 1.0000000
##  [115] 1.0000000 0.7701515 0.5366681 0.5366681 0.5366681 0.5366681
##  [121] 0.5366681 0.5366681 0.5366681 0.5366681 0.5366681 1.0000000
##  [127] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [133] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [139] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [145] 1.0000000 1.0000000 1.0000000 1.0000000 0.7383801 1.0000000
##  [151] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [157] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [163] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [169] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [175] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [181] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [187] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [193] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [199] 1.0000000 0.8806837 1.0000000 1.0000000 1.0000000 1.0000000
##  [205] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8806837
##  [211] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [217] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [223] 1.0000000 1.0000000 0.8652017 0.8257923 1.0000000 1.0000000
##  [229] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [235] 0.8652017 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [241] 1.0000000 1.0000000 1.0000000 0.8897024 1.0000000 1.0000000
##  [247] 1.0000000 0.8651845 1.0000000 1.0000000 1.0000000 1.0000000
##  [253] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [259] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [265] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [271] 1.0000000 1.0000000 1.0000000 0.8386937 1.0000000 1.0000000
##  [277] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [283] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [289] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [295] 0.8487600 0.8487600 0.8487600 1.0000000 0.8871582 1.0000000
##  [301] 0.9354869 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [307] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [313] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [319] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [325] 1.0000000 1.0000000 1.0000000 0.9354869 1.0000000 1.0000000
##  [331] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [337] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [343] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [349] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [355] 0.8116850 1.0000000 1.0000000 0.7701515 1.0000000 0.5799378
##  [361] 0.5799378 0.5799378 0.5799378 0.5799378 0.5799378 0.5799378
##  [367] 0.5799378 0.5799378 1.0000000 1.0000000 1.0000000 1.0000000
##  [373] 1.0000000 0.5454840 1.0000000 1.0000000 1.0000000 1.0000000
##  [379] 1.0000000 0.8160166 1.0000000 1.0000000 1.0000000 1.0000000
##  [385] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [391] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [397] 1.0000000 1.0000000 1.0000000 0.6917096 0.6917096 0.6917096
##  [403] 0.6917096 1.0000000 1.0000000 1.0000000 1.0000000 0.9380438
##  [409] 1.0000000 1.0000000 0.8897024 1.0000000 1.0000000 0.8160166
##  [415] 1.0000000 1.0000000 1.0000000 0.9422000 1.0000000 0.8870227
##  [421] 0.8357528 0.8870227 1.0000000 1.0000000 1.0000000 0.7610833
##  [427] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [433] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [439] 1.0000000 1.0000000 1.0000000 1.0000000 0.7375277 0.7375277
##  [445] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [451] 1.0000000 0.6917096 1.0000000 0.6917096 0.6917096 0.6917096
##  [457] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8566675
##  [463] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8910905
##  [469] 0.7842139 1.0000000 1.0000000 1.0000000 0.6262965 0.6766633
##  [475] 1.0000000 1.0000000 1.0000000 1.0000000 0.8777785 0.8777785
##  [481] 1.0000000 1.0000000 0.5507312 1.0000000 0.8728140 1.0000000
##  [487] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [493] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [499] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [505] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [511] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [517] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [523] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [529] 1.0000000 1.0000000 1.0000000 1.0000000 0.9469167 1.0000000
##  [535] 0.8330464 1.0000000 0.8728140 1.0000000 1.0000000 1.0000000
##  [541] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [547] 1.0000000 1.0000000 1.0000000 1.0000000 0.8396350 1.0000000
##  [553] 1.0000000 1.0000000 0.8707482 1.0000000 1.0000000 1.0000000
##  [559] 1.0000000 0.8137098 1.0000000 1.0000000 1.0000000 1.0000000
##  [565] 0.8566675 1.0000000 0.9208389 1.0000000 1.0000000 1.0000000
##  [571] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [577] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [583] 1.0000000 1.0000000 1.0000000 0.9160175 1.0000000 1.0000000
##  [589] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [595] 0.8137098 1.0000000 1.0000000 0.5366681 0.5799378 1.0000000
##  [601] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [607] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [613] 1.0000000 1.0000000 1.0000000 1.0000000 0.8066407 1.0000000
##  [619] 1.0000000 1.0000000 0.8066407 1.0000000 1.0000000 1.0000000
##  [625] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9380438
##  [631] 0.9380438 1.0000000 0.5099219 1.0000000 1.0000000 1.0000000
##  [637] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [643] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [649] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [655] 1.0000000 0.8676890 1.0000000 1.0000000 1.0000000 1.0000000
##  [661] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [667] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [673] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.7535611
##  [679] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [685] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [691] 1.0000000 1.0000000 1.0000000 0.5822601 0.5822601 0.8910905
##  [697] 1.0000000 0.8520536 1.0000000 1.0000000 1.0000000 0.5822601
##  [703] 1.0000000 0.6333826 0.5822601 0.6333826 1.0000000 1.0000000
##  [709] 1.0000000 0.5391612 1.0000000 1.0000000 0.8682573 0.9318033
##  [715] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [721] 1.0000000 1.0000000 0.8556435 1.0000000 1.0000000 0.8396350
##  [727] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8556435
##  [733] 1.0000000 1.0000000 1.0000000 1.0000000 0.7425065 1.0000000
##  [739] 0.8396350 0.8396350 1.0000000 1.0000000 1.0000000 1.0000000
##  [745] 1.0000000 0.8443960 1.0000000 1.0000000 1.0000000 1.0000000
##  [751] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [757] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [763] 1.0000000 1.0000000 0.7926949 0.7926949 1.0000000 0.5366681
##  [769] 0.5799378 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [775] 1.0000000 1.0000000 1.0000000 1.0000000 0.5099219 1.0000000
##  [781] 1.0000000 0.9247873 1.0000000 0.9247873 1.0000000 1.0000000
##  [787] 1.0000000 0.8682573 1.0000000 0.9318033 1.0000000 1.0000000
##  [793] 1.0000000 0.8897024 0.8897024 1.0000000 1.0000000 1.0000000
##  [799] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [805] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [811] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [817] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [823] 1.0000000 1.0000000 1.0000000 0.8066407 0.8066407 1.0000000
##  [829] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [835] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [841] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [847] 1.0000000 1.0000000 1.0000000 1.0000000 0.9476948 1.0000000
##  [853] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [859] 1.0000000 0.8566964 1.0000000 0.5011397 0.5545064 0.5545064
##  [865] 0.9476948 1.0000000 1.0000000 0.8424397 1.0000000 1.0000000
##  [871] 0.5454840 1.0000000 1.0000000 0.6569324 0.7051988 1.0000000
##  [877] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [883] 1.0000000 1.0000000 0.5011397 1.0000000 0.5863471 0.5863471
##  [889] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [895] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [901] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [907] 0.5421849 0.5421849 1.0000000 0.5545064 0.5863471 1.0000000
##  [913] 1.0000000 1.0000000 1.0000000 0.8137098 0.8137098 1.0000000
##  [919] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [925] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [931] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [937] 1.0000000 1.0000000 1.0000000 0.9043791 1.0000000 1.0000000
##  [943] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [949] 0.8566964 1.0000000 0.7535611 1.0000000 1.0000000 1.0000000
##  [955] 1.0000000 0.5366681 0.5799378 1.0000000 1.0000000 1.0000000
##  [961] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [967] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [973] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [979] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [985] 0.9511097 1.0000000 0.8095782 0.8095782 1.0000000 1.0000000
##  [991] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [997] 1.0000000 1.0000000 1.0000000 0.9318033
##  [ reached getOption("max.print") -- omitted 1744 entries ]

Another way of doing TD-IDF

dtm <- DocumentTermMatrix(coffee_corpus)
dtm_tfxidf <- weightTfIdf(dtm)
inspect(dtm_tfxidf[1:10, 1001:1010])
## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 1/99
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##     Terms
## Docs #handmade #happiness, #happy #happytuesday #happywednesday
##   1          0           0      0             0               0
##   10         0           0      0             0               0
##   2          0           0      0             0               0
##   3          0           0      0             0               0
##   4          0           0      0             0               0
##   5          0           0      0             0               0
##   6          0           0      0             0               0
##   7          0           0      0             0               0
##   8          0           0      0             0               0
##   9          0           0      0             0               0
##     Terms
## Docs #harleyquinn #hawaii #hazelnuts #headford   #health
##   1             0       0          0         0 0.0000000
##   10            0       0          0         0 0.0000000
##   2             0       0          0         0 0.0000000
##   3             0       0          0         0 0.0000000
##   4             0       0          0         0 0.0000000
##   5             0       0          0         0 0.0000000
##   6             0       0          0         0 0.0000000
##   7             0       0          0         0 0.0000000
##   8             0       0          0         0 0.4647395
##   9             0       0          0         0 0.0000000

Clustering

m <- as.matrix(dtm_tfxidf)
rownames(m) <- 1:nrow(m)

### don't forget to normalize the vectors so Euclidean makes sense
norm_eucl <- function(m) m/apply(m, MARGIN=1, FUN=function(x) sum(x^2)^.5)
m_norm <- norm_eucl(m)


### cluster into 10 clusters
cl <- kmeans(m_norm, 10)
table(cl$cluster)
## 
##   1   2   3   4   5   6   7   8   9  10 
##  13  15 883   7   9  14  32  10   9   8
dtm[cl$cluster == 1,]
## <<DocumentTermMatrix (documents: 13, terms: 4784)>>
## Non-/sparse entries: 177/62015
## Sparsity           : 100%
## Maximal term length: 73
## Weighting          : term frequency (tf)
findFreqTerms(dtm[cl$cluster==7,], 1)
##  [1] "@elfortney:"             "@winewankers:"          
##  [3] "#coffee"                 "#coffeelover"           
##  [5] "#wine"                   "#winelover"             
##  [7] "coffee."                 "drink"                  
##  [9] "https://t.co/mz8mg4fi9b" "https://t.co/pqxa5ev8qg"
## [11] "more"                    "prayer!"                
## [13] "problem"                 "quite"                  
## [15] "simple."                 "solution"               
## [17] "solution:"               "sometimes"              
## [19] "the"                     "your"
inspect(coffee_corpus[which(cl$cluster==7)])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 32
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`RT @elfortney: Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 135
## 
## $`Sometimes the solution to your problem is quite simple. The solution: drink more coffee. #Coffee https://t.co/Pqxa5Ev8qG`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 120
## 
## $`RT @winewankers: The prayer! #coffee #coffeelover #wine #winelover https://t.co/Mz8mG4FI9b`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 90